Document Classification Using Layout Analysis
نویسندگان
چکیده
This paper describes methods for document image classification at the spatial layout level. The goal is to develop fast algorithms for initial document type classification without OCR, which can then be verified using more elaborate methods based on more detailed geometric and syntactic models. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. We demonstrate the usefulness of these features derived from interval coding, in a hidden Markov model based page layout classification system that is trainable and extendible.
منابع مشابه
Content-free Document Genre Classification using First Order Random Graphs
We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the lay...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملPage Layout Classification Technique for Biomedical Documents
The structural layout information of scanned document pages is valuable for a wide range of document processing applications such as automatic document searching, document delivery and automated data entry. This paper describes the classification of scanned document pages into different classes of physical layout structures. The page layout classification technique proposed in this paper uses a...
متن کاملText Classification and Layout Analysis of Paper Fragments∗
Document image analysis such as text classification and layout analysis allow for the automated extraction of document properties. In general these methodologies are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets so that an automated clustering of documents can be performed. First, localized words are c...
متن کاملA Pattern-Based Method for Document Structure Recognition
One of the main goals of the CIDRE1 project is the design of an interactive document recognition system able to improve with use. In previous work, members of this project have already described software architecture issues [1, 6] as well as font and logical structure recognition algorithms [8, 2]. In this paper, we present a new method for document structure recognition based on geometrical an...
متن کامل